None
Utilizing Machine learning techniques to generate value from a data set of Pulp Sensibility. Using supervised learning algorithms to solve the classification problem of predicting the need of suppliment and also try to find out the root cause of this problem
We have collected 128 patients data with 20 features including the binary target feature whether a patient Need Supliment or not.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv("Pulp Sensibility.csv")
df.drop_duplicates(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 127 entries, 0 to 126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Patient 127 non-null object 1 Age 127 non-null int64 2 Dental History 127 non-null int64 3 Medical History 127 non-null object 4 Pain (VAS) 127 non-null int64 5 Pain ( Duration) days 127 non-null int64 6 Percussion 127 non-null int64 7 Palpation 127 non-null int64 8 Mobility 127 non-null int64 9 PDL involvement 127 non-null int64 10 Curved Canal 127 non-null int64 11 Pulp stone or and Calcification 127 non-null int64 12 PDL space 127 non-null int64 13 Lamina Dura 127 non-null object 14 Cold test ( VAS) Before anaesthesia 127 non-null int64 15 Cold test (Duration) Before anaesthesia 127 non-null int64 16 EPT ( VAS) before anaesthesia 127 non-null int64 17 EPT current pass 127 non-null int64 18 EPT (Duration) before anaesthesia 127 non-null int64 19 Need Supliment 127 non-null int64 dtypes: int64(17), object(3) memory usage: 20.8+ KB
df.head()
| Patient | Age | Dental History | Medical History | Pain (VAS) | Pain ( Duration) days | Percussion | Palpation | Mobility | PDL involvement | Curved Canal | Pulp stone or and Calcification | PDL space | Lamina Dura | Cold test ( VAS) Before anaesthesia | Cold test (Duration) Before anaesthesia | EPT ( VAS) before anaesthesia | EPT current pass | EPT (Duration) before anaesthesia | Need Supliment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 37 | 0 | 0 | 6 | 30 | 2 | 0 | 0 | 1 | 1 | 0 | 1 | LOSS | 0 | 0 | 5 | 32 | 5 | 1 |
| 1 | F | 47 | 0 | 0 | 2 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 80 | 3 | 1 |
| 2 | F | 27 | 0 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 23 | 5 | 27 | 19 | 1 |
| 3 | M | 27 | 0 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 27 | 5 | 21 | 37 | 1 |
| 4 | M | 23 | 0 | 0 | 4 | 60 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 5 | 12 | 2 | 43 | 5 | 1 |
Data profile report to explore the contents of the collected data set.
from pandas_profiling import ProfileReport
C:\Users\user\AppData\Local\Temp\ipykernel_34516\2274191625.py:1: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead. from pandas_profiling import ProfileReport
# raw_data = pd.read_csv("Pulp Sensibility.csv", index_col=False)
# df_profile = df.copy()
# Create and display a report summarizing the data in the hotel bookings data set
profile = ProfileReport(df,
title='Pulp Sensibility Data Profile Report',
html={'style': {
'full_width': True
}})
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
df_cat = df[['Patient', 'Dental History', 'Medical History',
'Percussion ', 'Palpation', 'Mobility',
'PDL involvement', 'Curved Canal ', 'Pulp stone or and Calcification',
'PDL space', 'Lamina Dura',
'Need Supliment']]
df_cat.head()
| Patient | Dental History | Medical History | Percussion | Palpation | Mobility | PDL involvement | Curved Canal | Pulp stone or and Calcification | PDL space | Lamina Dura | Need Supliment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 0 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 1 | LOSS | 1 |
| 1 | F | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | F | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | M | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | M | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
df_num = df.drop(df_cat.columns, axis=1)
df_num.head()
| Age | Pain (VAS) | Pain ( Duration) days | Cold test ( VAS) Before anaesthesia | Cold test (Duration) Before anaesthesia | EPT ( VAS) before anaesthesia | EPT current pass | EPT (Duration) before anaesthesia | |
|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 6 | 30 | 0 | 0 | 5 | 32 | 5 |
| 1 | 47 | 2 | 30 | 3 | 3 | 0 | 80 | 3 |
| 2 | 27 | 6 | 7 | 7 | 23 | 5 | 27 | 19 |
| 3 | 27 | 6 | 7 | 7 | 27 | 5 | 21 | 37 |
| 4 | 23 | 4 | 60 | 5 | 12 | 2 | 43 | 5 |
fig = px.bar(df['Need Supliment'].value_counts(), color =df['Need Supliment'].value_counts().index, text_auto=True, labels = dict(index = "Requirements of Suppliment",value = "Total Number of Patients"))
fig.show()
Observations:
def bar_chart(feature):
suppliment_need = df[df['Need Supliment']==1][feature].value_counts()
no_suppliment = df[df['Need Supliment']==0][feature].value_counts()
df_view = pd.DataFrame([suppliment_need,no_suppliment])
df_view.index = ['Suppliment (Required)','Suppliment (Not Required)']
fig = px.bar(df_view,barmode='group', text_auto=True, labels = dict(index = "Requirements of Suppliment",value = "Total Number of Patients"))
fig.show()
bar_chart('Patient')
Observations:
bar_chart('Medical History')
Observations:
bar_chart('Dental History')
Observations:
bar_chart('Pulp stone or and Calcification')
Observations:
fig = px.histogram(df,x='Age',text_auto=True)
fig.show()
Observations:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y="Age", text_auto=True)
fig.show()
Observations:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y="Pain ( Duration) days", text_auto=True)
fig.show()
Observations:
fig = px.bar(data_frame=df.groupby(['Need Supliment']).mean().reset_index(),x="Need Supliment",y='EPT (Duration) before anaesthesia ', text_auto=True)
fig.show()
Observations:
df['Lamina Dura'] = df['Lamina Dura'].map({'0':0,'LOSS':1})
df['Patient'] = df['Patient'].map({'M':1,'F':0})
df.head()
| Patient | Age | Dental History | Medical History | Pain (VAS) | Pain ( Duration) days | Percussion | Palpation | Mobility | PDL involvement | Curved Canal | Pulp stone or and Calcification | PDL space | Lamina Dura | Cold test ( VAS) Before anaesthesia | Cold test (Duration) Before anaesthesia | EPT ( VAS) before anaesthesia | EPT current pass | EPT (Duration) before anaesthesia | Need Supliment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37 | 0 | 0 | 6 | 30 | 2 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 5 | 32 | 5 | 1 |
| 1 | 0 | 47 | 0 | 0 | 2 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 80 | 3 | 1 |
| 2 | 0 | 27 | 0 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 23 | 5 | 27 | 19 | 1 |
| 3 | 1 | 27 | 0 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 27 | 5 | 21 | 37 | 1 |
| 4 | 1 | 23 | 0 | 0 | 4 | 60 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 5 | 12 | 2 | 43 | 5 | 1 |
clean_df = df.copy()
cat_features = clean_df.select_dtypes('object').columns
clean_df = pd.concat([clean_df.drop(cat_features, axis = 1),
pd.get_dummies(clean_df[cat_features])], axis = 1)
clean_df.head()
| Patient | Age | Dental History | Pain (VAS) | Pain ( Duration) days | Percussion | Palpation | Mobility | PDL involvement | Curved Canal | ... | Medical History_0 | Medical History_CARD | Medical History_DM | Medical History_DM, HTN | Medical History_DM, HTN, CAD | Medical History_HT0 | Medical History_HT0, DM | Medical History_HTN | Medical History_HTN, CARD | Medical History_HTN, DM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37 | 0 | 6 | 30 | 2 | 0 | 0 | 1 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 47 | 0 | 2 | 30 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 27 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 27 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 23 | 0 | 4 | 60 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 29 columns
df2 = pd.DataFrame(clean_df.corrwith(df['Need Supliment']).sort_values(ascending=False))
df2 = df2.set_axis(['Correlation Coefficient'], axis=1).reset_index().rename(columns={'index': 'Features'})
df2
| Features | Correlation Coefficient | |
|---|---|---|
| 0 | Need Supliment | 1.000000 |
| 1 | Pulp stone or and Calcification | 0.422838 |
| 2 | EPT (Duration) before anaesthesia | 0.289927 |
| 3 | Pain ( Duration) days | 0.285700 |
| 4 | Age | 0.273753 |
| 5 | EPT ( VAS) before anaesthesia | 0.251071 |
| 6 | Cold test (Duration) Before anaesthesia | 0.247909 |
| 7 | EPT current pass | 0.239994 |
| 8 | Cold test ( VAS) Before anaesthesia | 0.231929 |
| 9 | Palpation | 0.191575 |
| 10 | Dental History | 0.191014 |
| 11 | Mobility | 0.174159 |
| 12 | Medical History_DM | 0.171354 |
| 13 | Pain (VAS) | 0.161199 |
| 14 | Percussion | 0.159934 |
| 15 | Medical History_DM, HTN | 0.129024 |
| 16 | Medical History_HTN, CARD | 0.129024 |
| 17 | Medical History_HTN | 0.099893 |
| 18 | Curved Canal | 0.094392 |
| 19 | Medical History_DM, HTN, CAD | 0.063740 |
| 20 | Medical History_HTN, DM | 0.063740 |
| 21 | Patient | 0.033549 |
| 22 | PDL space | -0.003579 |
| 23 | Lamina Dura | -0.004864 |
| 24 | PDL involvement | -0.041200 |
| 25 | Medical History_CARD | -0.043146 |
| 26 | Medical History_HT0 | -0.124515 |
| 27 | Medical History_HT0, DM | -0.124515 |
| 28 | Medical History_0 | -0.239033 |
transformed_df = clean_df[df2[df2['Correlation Coefficient']>= 0.15]['Features'].to_list()]
transformed_df.head()
| Need Supliment | Pulp stone or and Calcification | EPT (Duration) before anaesthesia | Pain ( Duration) days | Age | EPT ( VAS) before anaesthesia | Cold test (Duration) Before anaesthesia | EPT current pass | Cold test ( VAS) Before anaesthesia | Palpation | Dental History | Mobility | Medical History_DM | Pain (VAS) | Percussion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 5 | 30 | 37 | 5 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 6 | 2 |
| 1 | 1 | 0 | 3 | 30 | 47 | 0 | 3 | 80 | 3 | 0 | 0 | 0 | 0 | 2 | 0 |
| 2 | 1 | 0 | 19 | 7 | 27 | 5 | 23 | 27 | 7 | 0 | 0 | 0 | 0 | 6 | 0 |
| 3 | 1 | 0 | 37 | 7 | 27 | 5 | 27 | 21 | 7 | 0 | 0 | 0 | 0 | 6 | 0 |
| 4 | 1 | 0 | 5 | 60 | 23 | 2 | 12 | 43 | 5 | 0 | 0 | 0 | 0 | 4 | 1 |
# Import the scikit-learn function used to split the data set
from sklearn.model_selection import train_test_split
# Designate the target feature as "y" and the explanatory features as "x"
y = transformed_df['Need Supliment']
x = transformed_df.drop('Need Supliment', axis=1)
# Create the train and test sets for x and y and specify a random_state seed for reproducibility
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 87)
GridSearch cross-validation for the logistic regression model is performed below.
# Import the scikit-learn functions and classes necessary to perform cross-validation
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
# Import the functions used to save and load a trained model
from joblib import dump, load
# Import the scikit-learn class used to train a logistic regression model
from sklearn.linear_model import LogisticRegression
# Create a pipeline specifying all of the operations to perform when training the model
# In this case, the pipepline consists of z-score standardization and fitting of a logistic regression model
pipeline_lr = make_pipeline(preprocessing.StandardScaler(), LogisticRegression(max_iter = 150))
# Specify the hyperparameters and their corresponding values that are to be used in GridSearch
hyperparameters_lr = { 'logisticregression__C' : [0.05, 0.1, 0.3, 0.5, 0.7, 0.9, 1] }
# Initialize the GridSearch cross-validation object, specifying 10 folds for 10-fold cross-validation and
# "f1" and "accuracy" as the evaluation metrics for cross-validation scoring
logistic_regression = GridSearchCV(pipeline_lr, hyperparameters_lr, cv = 10, scoring = ['f1', 'accuracy'],
refit = 'f1', verbose = 0, n_jobs = -1)
# Train and cross-validate the logistic regression model and ignore the function output
_ = logistic_regression.fit(x_train, y_train)
# Save the model so it can be used again without retraining it
_ = dump(logistic_regression, 'logistic_regression.joblib')
GridSearch cross-validation for the KNN model is performed below.
# Import the scikit-learn class used to implement a KNN classifier
from sklearn.neighbors import KNeighborsClassifier
# Create a pipeline specifying all of the operations to perform when training the model
# In this case, the pipepline consists of z-score standardization and initialization of a KNN classifier
pipeline_knn = make_pipeline(preprocessing.StandardScaler(), KNeighborsClassifier(algorithm = 'ball_tree'))
# Specify the hyperparameters and their corresponding values that are to be used in GridSearch
hyperparameters_knn = { 'kneighborsclassifier__n_neighbors' : [3, 5] }
# Initialize the GridSearch cross-validation object, specifying 5 folds for 5-fold cross-validation and
# "f1" and "accuracy" as the evaluation metrics for cross-validation scoring
knn = GridSearchCV(pipeline_knn, hyperparameters_knn, cv = 5, scoring = ['f1', 'accuracy'],
refit = 'f1', verbose = 0, n_jobs = -1)
# Cross-validate the KNN model and ignore the function output
_ = knn.fit(x_train, y_train)
# Save the model so it can be used again without redefining it
_ = dump(knn, 'knn.joblib')
Having trained and cross-validated the models, I then used the models to make predictions on the test set. I evaluated the performance of the models on the test set using the same F1 and accuracy metrics used to evaluate the models during cross-validation. The performance of the models as indicated by these metrics is displayed below.
# Import the scikit-learn functions used to calculate the F1 score and accuracy on the test set
from sklearn.metrics import f1_score, accuracy_score
# Use the best logistic regression model to make predictions on the test set
y_test_pred_lr = logistic_regression.predict(x_test)
# Display the F1 and ROC AUC on the train and test sets for the logistic regression model
print('Logistic regression F1 (train):',
round(logistic_regression.cv_results_['mean_test_f1'][logistic_regression.best_index_], 3))
print('Logistic regression F1 (test):', round(f1_score(y_test, y_test_pred_lr), 3), '\n')
print('Logistic regression accuracy (train):',
round(logistic_regression.cv_results_['mean_test_accuracy'][logistic_regression.best_index_], 3))
print('Logistic regression accuracy (test):',
round(accuracy_score(y_test, y_test_pred_lr), 3), '\n')
# Use the best KNN model to make predictions on the test set
y_test_pred_knn = knn.predict(x_test)
# Display the F1 and ROC AUC on the train and test sets for the KNN model
print('KNN F1 (train):',
round(knn.cv_results_['mean_test_f1'][knn.best_index_], 3))
print('KNN F1 (test):', round(f1_score(y_test, y_test_pred_knn), 3), '\n')
print('KNN accuracy (train):',
round(knn.cv_results_['mean_test_accuracy'][knn.best_index_], 3))
print('KNN accuracy (test):',
round(accuracy_score(y_test, y_test_pred_knn), 3), '\n')
Logistic regression F1 (train): 0.761 Logistic regression F1 (test): 0.895 Logistic regression accuracy (train): 0.703 Logistic regression accuracy (test): 0.846 KNN F1 (train): 0.759 KNN F1 (test): 0.865 KNN accuracy (train): 0.712 KNN accuracy (test): 0.808
To objectively determine the degree of bias and variance exhibited by the models, I used the guidelines presented below.
Bias: